{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fc2826d9",
   "metadata": {},
   "source": [
    "# Data Profiler - What's in your data?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b997522b",
   "metadata": {},
   "source": [
    "This introductory jupyter notebook demonstrates the basic usages of the Data Profiler. The library is designed to easily detect sensitive data and gather statistics on your datasets with just several lines of code. The Data Profiler can handle several different data types including: CSV (or any delimited file), JSON, Parquet, AVRO, and text. Additionally, there are a plethora of options to customize your profile. This library also has the ability to update profiles from multiple batches of large datasets, or merge multiple profiles. In particular, this example covers the followings:\n",
    "\n",
    "- Basic usage of the Data Profiler\n",
    "- The data reader class\n",
    "- Profiler options\n",
    "- Updating profiles and merging profiles\n",
    "\n",
    "First, let's import the libraries needed for this example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef404c84",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import json\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "sys.path.insert(0, '..')\n",
    "import dataprofiler as dp\n",
    "\n",
    "data_path = \"../dataprofiler/tests/data\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f51971e3",
   "metadata": {},
   "source": [
    "## Basic Usage of the Data Profiler"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "639e66d3",
   "metadata": {},
   "source": [
    "This section shows the basic example of the Data Profiler. A CSV dataset is read using the data reader, then the Data object is given to the Data Profiler to detect sensitive data and obtain the statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5379c45c",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# use data reader to read input data\n",
    "data = dp.Data(os.path.join(data_path, \"csv/aws_honeypot_marx_geo.csv\"))\n",
    "print(data.data.head())\n",
    "\n",
    "# run data profiler and get the report\n",
    "profile = dp.Profiler(data)\n",
    "report  = profile.report(report_options={\"output_format\":\"compact\"})\n",
    "\n",
    "# print the report\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57fe2827",
   "metadata": {},
   "source": [
    "The report includes `global_stats` and `data_stats` for the given dataset. The former contains overall properties of the data such as number of rows/columns, null ratio, duplicate ratio, while the latter contains specific properties and statistics for each column such as detected data label, min, max, mean, variance, etc. In this example, the `compact` format of the report is used to shorten the full list of the results. To get more results related to detailed predictions at the entity level from the Data Labeler component or histogram results, the format `pretty` should be used."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74027cfd",
   "metadata": {},
   "source": [
    "## Data reader class"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41364888",
   "metadata": {},
   "source": [
    "DataProfiler can detect multiple file types including CSV (or any delimited file), JSON, Parquet, AVRO, and text. The example below shows that it successfully detects data types from multiple categories regardless of the file extensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "823829f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# use data reader to read input data with different file types\n",
    "csv_files = [\n",
    "    \"csv/aws_honeypot_marx_geo.csv\",\n",
    "    \"csv/all-strings-skip-header-author.csv\", # csv files with the author/description on the first line\n",
    "    \"csv/sparse-first-and-last-column-empty-first-row.txt\", # csv file with the .txt extension\n",
    "]\n",
    "json_files = [\n",
    "    \"json/complex_nested.json\",\n",
    "    \"json/honeypot_intentially_mislabeled_file.csv\", # json file with the .csv extension\n",
    "]\n",
    "parquet_files = [\n",
    "    \"parquet/nation.dict.parquet\",\n",
    "    \"parquet/nation.plain.intentionally_mislabled_file.csv\", # parquet file with the .csv extension\n",
    "]\n",
    "avro_files = [\n",
    "    \"avro/userdata1.avro\",\n",
    "    \"avro/userdata1_intentionally_mislabled_file.json\", # avro file with the .json extension\n",
    "]\n",
    "text_files = [\n",
    "    \"txt/discussion_reddit.txt\",\n",
    "]\n",
    "\n",
    "all_files = {\n",
    "    \"csv\": csv_files,\n",
    "    \"json\": json_files,\n",
    "    \"parquet\": parquet_files,\n",
    "    \"avro\": avro_files,\n",
    "    \"text\": text_files\n",
    "}\n",
    "\n",
    "for file_type in all_files:\n",
    "    print(file_type)\n",
    "    for file in all_files[file_type]:\n",
    "        data = dp.Data(os.path.join(data_path, file))\n",
    "        print(\"{:<85} {:<15}\".format(file, data.data_type))\n",
    "    print(\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f9d7e02",
   "metadata": {},
   "source": [
    "The `Data` class detects the file type and uses one of the following classes: `CSVData`, `JSONData`, `ParquetData`, `AVROData`, `TextData`. Users can call these specific classes directly if desired. For example, below we provide a collection of data with different types, each of them is processed by the corresponding data class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "831e68a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# use individual data reader classes\n",
    "from dataprofiler.data_readers.csv_data import CSVData\n",
    "from dataprofiler.data_readers.json_data import JSONData\n",
    "from dataprofiler.data_readers.parquet_data import ParquetData\n",
    "from dataprofiler.data_readers.avro_data import AVROData\n",
    "from dataprofiler.data_readers.text_data import TextData\n",
    "\n",
    "csv_files = \"csv/aws_honeypot_marx_geo.csv\"\n",
    "json_files = \"json/complex_nested.json\"\n",
    "parquet_files = \"parquet/nation.dict.parquet\"\n",
    "avro_files = \"avro/userdata1.avro\"\n",
    "text_files = \"txt/discussion_reddit.txt\"\n",
    "\n",
    "all_files = {\n",
    "    \"csv\": [csv_files, CSVData],\n",
    "    \"json\": [json_files, JSONData],\n",
    "    \"parquet\": [parquet_files, ParquetData],\n",
    "    \"avro\": [avro_files, AVROData],\n",
    "    \"text\": [text_files, TextData],\n",
    "}\n",
    "\n",
    "for file_type in all_files:\n",
    "    file, data_reader = all_files[file_type]\n",
    "    data = data_reader(os.path.join(data_path, file))\n",
    "    print(\"File name {}\\n\".format(file))\n",
    "    if file_type == \"text\":\n",
    "        print(data.data[0][:1000]) # print the first 1000 characters\n",
    "    else:\n",
    "        print(data.data)\n",
    "    print('===============================================================================')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "572df0a8",
   "metadata": {},
   "source": [
    "In addition to reading the input data from multiple file types, the Data Profiler allows the input data as a dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df87ab83",
   "metadata": {},
   "outputs": [],
   "source": [
    "# run data profiler and get the report\n",
    "my_dataframe = pd.DataFrame([[1, 2.0],[1, 2.2],[-1, 3]], columns=[\"col_int\", \"col_float\"])\n",
    "profile = dp.Profiler(my_dataframe)\n",
    "report  = profile.report(report_options={\"output_format\":\"compact\"})\n",
    "\n",
    "# Print the report\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84a06312",
   "metadata": {},
   "source": [
    "## Structured Profiler vs. Unstructured Profiler"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c0ea925",
   "metadata": {},
   "source": [
    "The profiler will infer what type of statistics to generate (structured or unstructured) based on the input. However, you can explicitly specify profile type as well. Here is an example of the the profiler explicitly calling the structured profile and the unstructured profile."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f4565d8",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Using the structured profiler\n",
    "data = dp.Data(os.path.join(data_path, \"csv/aws_honeypot_marx_geo.csv\"))\n",
    "profile = dp.Profiler(data, profiler_type='structured')\n",
    "\n",
    "report = profile.report(report_options={\"output_format\": \"pretty\"})\n",
    "print(json.dumps(report, indent=4))\n",
    "\n",
    "# Using the unstructured profiler\n",
    "my_dataframe = pd.DataFrame([[\"Sample1\"],[\"Sample2\"],[\"Sample3\"]], columns=[\"Text_Samples\"])\n",
    "profile = dp.Profiler(my_dataframe, profiler_type='unstructured')\n",
    "\n",
    "report  = profile.report(report_options={\"output_format\":\"pretty\"})\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b16648ba",
   "metadata": {},
   "source": [
    "## Profiler options"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b0cc8ad",
   "metadata": {},
   "source": [
    "The Data Profiler can enable/disable statistics and modify features through profiler options. For example, if the users only want the statistics information, they may turn off the Data Labeler functionality. Below, let's remove the histogram and data labeler component while running Data Profiler."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bbac3a2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "profile_options = dp.ProfilerOptions()\n",
    "profile_options.set({\"histogram_and_quantiles.is_enabled\": False,\n",
    "                     \"median_abs_deviation.is_enabled\": False,\n",
    "                     \"median.is_enabled\": False,\n",
    "                     \"mode.is_enabled\": False,\n",
    "                     \"data_labeler.is_enabled\": False,})\n",
    "\n",
    "profile = dp.Profiler(my_dataframe, options=profile_options)\n",
    "report  = profile.report(report_options={\"output_format\":\"pretty\"})\n",
    "\n",
    "# Print the report\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "590ca50b",
   "metadata": {},
   "source": [
    "Besides toggling on and off features, other options like the data labeler sample size or histogram bin method can be directly set and validated as shown here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ed21bc1",
   "metadata": {},
   "outputs": [],
   "source": [
    "profile_options = dp.ProfilerOptions()\n",
    "profile_options.structured_options.data_labeler.sample_size = 1\n",
    "profile_options.structured_options.int.histogram_and_quantiles.bin_count_or_method = \"rice\"\n",
    "# An error will raise if the options are set incorrectly.\n",
    "profile_options.validate()\n",
    "\n",
    "profile = dp.Profiler(my_dataframe, options=profile_options)\n",
    "report  = profile.report(report_options={\"output_format\":\"pretty\"})\n",
    "\n",
    "# Print the report\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f690616",
   "metadata": {},
   "source": [
    "## Update profiles"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "965f8c85",
   "metadata": {},
   "source": [
    "One of the interesting features of the Data Profiler is the ability to update profiles from batches of data, which allows for data streaming usage. In this section, the original dataset is separated into two batches with equal size. Each batch is then updated with Data Profiler sequentially.  \n",
    "\n",
    "After the update, we expect the resulted profiles give the same statistics as the profiles updated from the full dataset. We will verify that through some properties in `global_stats` of the profiles including `column_count`, `row_count`, `row_is_null_ratio`, `duplicate_row_count`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34ac4346",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# read the input data and devide it into two equal halves\n",
    "data = dp.Data(os.path.join(data_path, \"csv/aws_honeypot_marx_geo.csv\"))\n",
    "df = data.data\n",
    "df1 = df.iloc[:int(len(df)/2)]\n",
    "df2 = df.iloc[int(len(df)/2):]\n",
    "\n",
    "# Update the profile with the first half\n",
    "profile = dp.Profiler(df1)\n",
    "\n",
    "# Update the profile with the second half\n",
    "profile.update_profile(df2)\n",
    "\n",
    "# Update profile with the full dataset\n",
    "profile_full = dp.Profiler(df)\n",
    "\n",
    "report  = profile.report(report_options={\"output_format\":\"compact\"})\n",
    "report_full  = profile_full.report(report_options={\"output_format\":\"compact\"})\n",
    "\n",
    "# print the report\n",
    "print(json.dumps(report, indent=4))\n",
    "print(json.dumps(report_full, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b41ee2bf",
   "metadata": {},
   "source": [
    "You can see that the profiles are exactly the same whether they are broken into several updates or not."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c547f051",
   "metadata": {},
   "source": [
    "## Merge profiles"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5292962",
   "metadata": {},
   "source": [
    "In addition to the profile update, Data Profiler provides the merging functionality which allows users to combine the profiles updated from multiple locations. This enables Data Profiler to be used in a distributed computing environment. Below, we assume that the two aforementioned halves of the original dataset come from two different machines. Each of them is then updated with the Data Profiler on the same machine, then the resulted profiles are merged.\n",
    "\n",
    "As with the profile update, we expect the merged profiles give the same statistics as the profiles updated from the full dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a565b8d1",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Update the profile with the first half\n",
    "profile1 = dp.Profiler(df1)\n",
    "\n",
    "# Update the profile with the second half\n",
    "profile2 = dp.Profiler(df2)\n",
    "\n",
    "# merge profiles\n",
    "profile_merge = profile1 + profile2\n",
    "\n",
    "# check results of the merged profile\n",
    "report_merge  = profile.report(report_options={\"output_format\":\"compact\"})\n",
    "\n",
    "# print the report\n",
    "print(json.dumps(report_merge, indent=4))\n",
    "print(json.dumps(report_full, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b77fac3f",
   "metadata": {},
   "source": [
    "You can see that the profiles are exactly the same!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c644ee42",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "We have walked through some basic examples of Data Profiler usage, with different input data types and profiling options. We also work with update and merging functionality of the Data Profiler, which make it applicable for data streaming and distributed environment. Interested users can try with different datasets and functionalities as desired."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
